In this notebook, we extend the HP separation approach from [Müller, FMP, Springer 2015] by considering an additional residual component. Furthermore, we introduce a cascaded procedure of the resulting approach. These two extension were proposed in the following two articles:
One problem in harmonic–percussive (HP) separation is that a sound may contain noise-like events (e.g, applause, distorted guitar) that are neither of harmonic nor of percussive nature. Following Driedger et al. (see also Exercise 8.5 of [Müller, FMP, Springer 2015]), we introduce an extension to HP separation by considering a third residual component which captures the sounds that lie "between" a clearly harmonic and a clearly percussive component. To this end, we consider an additional parameter $\beta\in\mathbb{R}$ with $\beta \geq 1$ called the separation factor. Let $\tilde{\mathcal{Y}}^\mathrm{h}$ and $\tilde{\mathcal{Y}}^\mathrm{p}$ be the filtered spectrograms as descirbed in the notebook on HP separation (Section 8.1.1 of [Müller, FMP, Springer 2015]). Generalizing the definition of the binary masks used in HP separation, we define the binary masks $\mathcal{M}^\mathrm{h}$, $\mathcal{M}^\mathrm{r}$, and $\mathcal{M}^\mathrm{p}$ for the clearly harmonic, the clearly percussive, and the residual components by setting
\begin{eqnarray*} \mathcal{M}^\mathrm{h}(n,k) &:=& \begin{cases} 1 & \text{if } \tilde{\mathcal{Y}}^\mathrm{h}(n,k) \geq \beta\cdot \tilde{\mathcal{Y}}^\mathrm{p}(n,k), \\ 0 & \text{otherwise,} \end{cases} \\ \mathcal{M}^\mathrm{p}(n,k) &:=& \begin{cases} 1 & \text{if } \tilde{\mathcal{Y}}^\mathrm{p}(n,k) > \beta\cdot \tilde{\mathcal{Y}}^\mathrm{h}(n,k), \\ 0 & \text{otherwise,} \end{cases}\\ \mathcal{M}^\mathrm{r}(n,k) &:=& 1 - \big( \mathcal{M}^\mathrm{h}(n,k) + \mathcal{M}^\mathrm{p}(n,k) \big). \end{eqnarray*}Using these masks, one defines
\begin{eqnarray*} \mathcal{X}^\mathrm{h}(n,k) &:=& \mathcal{M}^\mathrm{h}(n,k) \cdot \mathcal{X}(n,k), \\ \mathcal{X}^\mathrm{p}(n,k) &:=& \mathcal{M}^\mathrm{p}(n,k) \cdot \mathcal{X}(n,k), \\ \mathcal{X}^\mathrm{r}(n,k) &:=& \mathcal{M}^\mathrm{r}(n,k) \cdot \mathcal{X}(n,k) \end{eqnarray*}for $n,k\in\mathbf{Z}$. The following figure shows a typical example for the resulting decomposition of a given STFT:

Finally, one can derive time-domain signal $x^\mathrm{h}$, $x^\mathrm{p}$, and $x^\mathrm{r}$ by applying an inverse STFT. As shown by the following examples, the separation factor $\beta$ can be used to adjust the decomposition. The case $\beta=1$ translates to the original HP decomposition. By increasing $\beta$, less time–frequency bins are assigned for the reconstruction of the components $x^\mathrm{h}$ and $x^\mathrm{p}$, whereas more time–frequency bins are used for the reconstruction of the residual component $x^\mathrm{r}$.
import os
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import librosa.display
from scipy import signal
%matplotlib inline
def horizontal_median_filter(x, filter_len):
return signal.medfilt(x, [1, filter_len])
def vertical_median_filter(x, filter_len):
return signal.medfilt(x, [filter_len, 1])
def convert_L_h(L_h_sec, Fs=22050, N=1024, H=512):
L_h = int(np.ceil(L_h_sec * Fs/H))
if(L_h%2==0): L_h +=1
return L_h
def convert_L_p(L_p_Hz, Fs=22050, N=1024, H=512):
L_p = int(np.ceil(L_p_Hz * N / Fs))
if(L_p%2==0): L_p +=1
return L_p
def HRPS(x, N, H, w, Fs, L_h_sec, L_p_Hz, beta=1):
# x: Input signal
# N: Frame length
# H: Hopsize
# w: Window function of length N
# Fs: Sampling rate of x
# L_h_sec: Horizontal median filter length given in seconds
# L_p_Hz: Percussive median filter length given in Hertz
# stft
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window=w, center=True, pad_mode='constant')
# power spectrogram
Y = np.abs(X) ** 2
# median filtering
L_h = convert_L_h(L_h_sec=L_h_sec, Fs=Fs, N=N, H=H)
L_p = convert_L_p(L_p_Hz=L_p_Hz, Fs=Fs, N=N, H=H)
Y_h = horizontal_median_filter(Y, L_h)
Y_p = vertical_median_filter(Y, L_p)
# masking
M_h = np.int8(Y_h >= beta * Y_p)
M_p = np.int8(Y_p > beta * Y_h)
M_r = 1 - (M_h + M_p)
X_h = X * M_h
X_p = X * M_p
X_r = X * M_r
# istft
x_h = librosa.istft(X_h, hop_length=H, win_length=N, window=w, center=True, length=x.size)
x_p = librosa.istft(X_p, hop_length=H, win_length=N, window=w, center=True, length=x.size)
x_r = librosa.istft(X_r, hop_length=H, win_length=N, window=w, center=True, length=x.size)
return x_h, x_p, x_r
Intuitively, the larger the parameter $\beta$, the clearer becomes the harmonic and percussive nature of the components $x^\mathrm{h}$ and $x^\mathrm{p}$. For very large $\beta$, the residual signal $x^\mathrm{r}$ tends to contain the entire signal $x$. This is illustrated by the following figure and the code example.

import pandas as pd
from collections import OrderedDict
def generate_audio_tag_html_list(sinusoid_list, Fs):
audio_tag_html_list = []
for i in range(len(sinusoid_list)):
audio_tag = ipd.Audio( sinusoid_list[i], rate=Fs)
audio_tag_html = audio_tag._repr_html_().replace('\n', '').strip()
audio_tag_html = audio_tag_html.replace('<audio ',
'<audio style="width: 100px; height: 40px;"')
audio_tag_html_list.append(audio_tag_html)
return audio_tag_html_list
Fs = 22050
x, Fs = librosa.load(os.path.join('..', 'data', 'C8', 'FMP_C8_F02_Long_CastanetsViolinApplause.wav'), sr=Fs)
param_list = [
[1024,256,0.2,500, 1],
[1024,256,0.2,500, 2],
[1024,256,0.2,500, 4],
[1024,256,0.2,500, 8]
]
list_x = []
list_x_h = []
list_x_p = []
list_x_r = []
list_N = []
list_H = []
list_L_h_sec = []
list_L_p_Hz = []
list_L_h = []
list_L_p = []
list_beta = []
for param in param_list:
N, H, L_h_sec, L_p_Hz, beta = param
print('N=%4d, H=%4d, L_h_sec=%4.2f, L_p_Hz=%3.1f, beta=%3.1f' %(N, H, L_h_sec, L_p_Hz, beta))
x_h, x_p, x_r = HRPS(x, N=N, H=H, w='hann', Fs=Fs, L_h_sec=L_h_sec, L_p_Hz=L_p_Hz, beta=beta)
L_h = convert_L_h(L_h_sec=L_h_sec, Fs=Fs, N=N, H=H)
L_p = convert_L_p(L_p_Hz=L_p_Hz, Fs=Fs, N=N, H=H)
list_x.append(x)
list_x_h.append(x_h)
list_x_p.append(x_p)
list_x_r.append(x_r)
list_N.append(N)
list_H.append(H)
list_L_h_sec.append(L_h_sec)
list_L_p_Hz.append(L_p_Hz)
list_L_h.append(L_h)
list_L_p.append(L_p)
list_beta.append(beta)
html_x = generate_audio_tag_html_list(list_x, Fs=Fs)
html_x_h = generate_audio_tag_html_list(list_x_h, Fs=Fs)
html_x_p = generate_audio_tag_html_list(list_x_p, Fs=Fs)
html_x_r = generate_audio_tag_html_list(list_x_r, Fs=Fs)
pd.options.display.float_format = '{:,.1f}'.format
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame(OrderedDict([
('$N$', list_N),
('$H$', list_H),
('$L_h$ (sec)', list_L_h_sec),
('$L_p$ (Hz)', list_L_p_Hz),
('$L_h$', list_L_h),
('$L_p$', list_L_p),
('$\\beta$', list_beta),
('$x$', html_x),
('$x_h$', html_x_h),
('$x_r$', html_x_r),
('$x_p$', html_x_p)]))
df.index = np.arange(1, len(df) + 1)
ipd.HTML(df.to_html(escape=False, index=False))
The HRP separation procedure was further extended by López-Serrano et al. by introducing a cascaded version of it. This yields a mid-level feature to analyze musical phenomena like percussive event density, timbral changes, and homogeneous structural segments. The following figure illustrated the cascaded harmonic–residual–percussive (CHRP) separation procedure. First, using a large separation factor $\beta$ (e.g. $\beta=5$), a signal is separated into harmonic (H), residual (R), and percussive (P) components. In the second and third rows, the residual component from the previous pass is separated into the H, R, and P components, using decreasing separation factors. The last row shows how the cascade components can be ordered along an HRP axis.

In general, when using $B\in\mathbb{B}$ cascading stages with decreasing separation factors $\beta_1 > \beta_2 > \ldots, \beta_B$, the CHRP proedure will produce $(2B+1)$ component signals—their sum is equal to the input singal x (up to a small error). Once all the required cascading stages are complete, the component signals are sorted on an axis that goes from harmonic, through residual, to percussive. For the examples, using $B = 3$, the CHRP procedure produces seven component signals referred to as H, RH, RRH, RRR, RRP, RP, and P.
Motivated by tasks such as musical event density and structure analysis, we now introduce a mid-level feature representation. To this end, we compute the local energy for each of the $2B + 1$ component signals using a sliding window technique. To this end, we use a window $w^\mathrm{e}$ of length $N^\mathrm{e}\in\mathbb{N}$ and hop size $H^\mathrm{e}\in\mathbb{N}$. Let $M^\mathrm{e}\in\mathbb{N}$ be the number of energy frames. We then stack the local energy values of all component signal into a feature matrix $V = (v_1, \ldots, v_{M^\mathrm{e}})$ such that $v_m \in \mathbb{R}^{2B+1}$, for $m \in [1:M^\mathrm{e}]$. Furthermore, the columns $v_m$ may be normalized using, e.g., the $\ell\text{--}1$ norm. Then each $v_m \in [0,1]^{2B+1}$ expresses the energy distribution accross the components for each frame $m \in [1:M^\mathrm{e}]$.
In the following example, we show the CHRP feature matrix for a signal consisting of five non-overlapping soundvsamples: castanets, snare roll, applause, staccato strings and legato violin. The castanets have the highest percussive energy and are well confined to the P component. The snare roll is predominantly composed of RP and RRR. Indeed, snare drums have a percussive attack and a decay curve which is both noisy and tonal (according to the drum's tuning frequency). When struck in rapid succession, the noisy decay tails overpower the percussive onsets. Applause is centered around RRR and well-confined to the residual region. The staccato strings are predominantly harmonic, with an additional RH component which corresponds to the noisy attacks that emerge in this playing technique. Violin legato is confined to the H component, since the stable, harmonic signal properties dominate all other components.

The next example shows energy migrating from percussive to residual in a snare-playing technique known as paradiddles. In the next figure, we show the waveform of paradiddles played on a snare drum; first with increasing speed (0-40 sec), and then with decreasing speed (40-75 sec). In the resulting CHRP feature matrix, one can notice how there is very little remaining P energy after a certain onset frequency or playing speed has been reached (around 25 sec). This is due to the fact that the noise-like tails reach a relative proximity and overpower the individual percussive onsets, centering the energy around the residual components in the feature matrix.

Acknowledgment: This notebook was created by Meinard Müller.